Statistical Outlier Detection in Large Multivariate Datasets

نویسندگان

  • Pradipto Das
  • Deba Prasad Mandal
چکیده

This work focuses on detecting outliers within large and very large datasets using a computationally efficient procedure. The algorithm uses Tukey’s biweight function applied on the dataset to filter out the effects of extreme values for obtaining appropriate location and scale estimates. Robust Mahalanobis distances for all data points are calculated using these location and scale estimates. A suitable rejection point for the outliers is determined by a separation boundary obtained using non-parametric density estimation by Parzen window where the probability density curve of the robust Mahalanobis distances descends and then again ascends for outlying distances. This procedure demonstrates good success at identifying outliers even in cases where data is highly skewed and overlapping, compared to established statistical outlier detection methods for both univariate and multivariate data where the underlying distribution needs to be known.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detecting outliers in high-dimensional neuroimaging datasets with robust covariance estimators

Medical imaging datasets often contain deviant observations, the so-called outliers, due to acquisition or preprocessing artifacts or resulting from large intrinsic inter-subject variability. These can undermine the statistical procedures used in group studies as the latter assume that the cohorts are composed of homogeneous samples with anatomical or functional features clustered around a cent...

متن کامل

An Empirical Comparison of Outlier Detection Methods

Four outlier detection methods are compared using both publicly available smaller statistical datasets and real-life Knowledge Discovery in Databases (KDD) datasets [1]. The smaller datasets provide insight (via visualisations) into the relative strengths and weaknesses of the compared methods. The real-life large datasets test scalability and practicality of application. We are unaware of prev...

متن کامل

Identification of outliers types in multivariate time series using genetic algorithm

Multivariate time series data, often, modeled using vector autoregressive moving average (VARMA) model. But presence of outliers can violates the stationary assumption and may lead to wrong modeling, biased estimation of parameters and inaccurate prediction. Thus, detection of these points and how to deal properly with them, especially in relation to modeling and parameter estimation of VARMA m...

متن کامل

Z-Glyph: Visualizing outliers in multivariate data

Outlier analysis techniques are extensively used in many domains such as intrusion detection. Today, even with the most advanced statistical learning techniques, human judgment still plays an important role in outlier analysis tasks due to the difficulty of defining and collecting outlier examples. This work seeks to tackle this problem by introducing a new visualization design, ‘‘Z-Glyph,’’ a ...

متن کامل

Multivariate Outlier Detection Using Independent Component Analysis

The recent developments by considering a rather unexpected application of the theory of Independent component analysis (ICA) found in outlier detection , data clustering and multivariate data visualization etc . Accurate identification of outliers plays an important role in statistical analysis. If classical statistical models are blindly applied to data containing outliers, the results can be ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005